168 research outputs found

    Soundscape visualization: a new approach based on automatic annotation and Samocharts

    Get PDF
    International audienceThe visualization of sounds facilitates their identification and classification. However, in the case of audio recording websites, the access to a sound is usually based on the metadata of the sounds, i.e.sources and recording conditions. As sonic environments, or soundscapes, are mostly composed of multiples sources, their compact description is an issue that makes difficult the choice of an item in a sound corpus. The time-component matrix chart, which is abbreviated as TM-chart, has been proposed recently as a tool to describe and compare sonic environments. However their process of creation is based on a subjective annotation that makes their creation time-consuming. In this paper, we present a new method for urban soundscape corpus visualization. In the context of the CIESS project, we propose amochart: an extension of the TM-chart that is based on sound detection algorithms. We describe three original algorithms that allow the detection of alarms, footsteps, and motors. Samocharts can be computed from the results of these algorithms. This process is applied to a concrete case study: 20 urban recordings of 5 minutes each, from different situations (places and time). An application case shows that Samocharts allow an identification of different situations. Finally, the whole method provides a low-cost tool for soundscape visualization that can easily be applied to the management and use of a sound corpus

    Indexation sonore : recherche de composantes primaires pour une structuration audiovisuelle

    Get PDF
    M. Paul DELÉGLISE – Professeur à l'Université du Maine – Rapporteur M. Patrick GROS – Chargé de Recherche à l'IRISA Rennes – Rapporteur M. Daniel DOURS – Professeur à l'Université Toulouse III – Président du jury M. Jean CARRIVE – Ingénieur de Recherche à l'Institut National de l'Audiovisuel – Membre M. Dominique FOHR – Chargé de Recherche au LORIA Nancy – MembreTo process the quantity of audiovisual information available in a smart and rapid way, it is necessary to have robust and automatic tools. This work addresses the soundtrack indexing and structuring of multimedia documents. Their goals are to detect the primary components: speech, music and key sounds. For speech/music classification, three unusual parameters are extracted: entropy modulation, stationary segment duration (with a Forward-Backward Divergence algorithm) and the number of segments. These three parameters are merged with the classical 4 Hertz modulation energy. Experiments on radio corpora show the robustness of these parameters. The system is compared and merged with a classical system. Another partitioning consists in detecting pertinent key sounds. For jingles, the selection of candidates is done by comparing the “signature” of each jingle with the data flow. This system is simple, fast and efficient. Applause and laughter are based on GMM with spectral analysis. A TV corpus validates this study by encouraging results. The detection of key words is carried out in a traditional way: the problem here is not to improve the existing systems but to be in a structuring task: these key words inform about the program type (news, weather, documentary...). Through two studies, a reflection is done for the component uses in order to find a temporal structure of the audiovisual documents. The first study is a detection of a recurring production invariant in program collections. The second permits to structure TV news into topics. Some examples of video analysis contribution are developed.Le développement croissant des données numériques et l'explosion des accès multimédia à l'information, sont confrontés au manque d'outils automatiques efficaces. Dans ce cadre, plusieurs approches relatives à l'indexation et la structuration de la bande sonore de documents audiovisuels sont proposées. Leurs buts sont de détecter les composantes primaires telles que la parole, la musique et les sons clés (jingles, sons caractéristiques, mots clés...). Pour la classification parole/musique, trois paramètres inhabituels sont extraits : la modulation de l'entropie, la durée des segments (issue d'une segmentation automatique) et le nombre de ces segments par seconde. Les informations issues de ces trois paramètres sont ensuite fusionnées avec celle issue de la modulation de l'énergie à quatre hertz. Des expériences sur un corpus radiophonique montrent la robustesse de ces paramètres : notre système possède un taux de classification correcte supérieur à 90%. Le système est ensuite comparé, puis fusionné à un système classique basé sur des Modèles de Mélanges de lois Gaussiennes (MMG) et une analyse cepstrale. Un autre partitionnement consiste à détecter des sons clés. La sélection de candidats potentiels est effectuée en comparant la « signature » de chacun des jingles au flux de données. Ce système est simple par sa mise en œuvre mais rapide et très efficace : sur un corpus audiovisuel d'une dizaine d'heures (environ 200 jingles) aucune fausse alarme n'est présente. Il y a seulement deux omissions dans des conditions extrêmes. Les sons caractéristiques (applaudissements et rires) sont modélisés à l'aide de MMG dans le domaine spectral. Un corpus télévisuel permet de valider cette première étude par des résultats encourageants. La détection de mots clés est effectuée de manière classique : il ne s'agit pas ici d'améliorer les systèmes existants mais de se placer toujours dans un besoin de structuration. Ainsi, ces mots clés renseignent sur le type des émissions (journal, météo, documentaire...). Grâce à l'extraction de ces composantes primaires, les émissions audiovisuelles peuvent être annotées de manière automatique. Au travers de deux études, une réflexion est conduite quant à l'utilisation de ces composantes afin de trouver une structure temporelle aux documents. La première étude permet une détection d'un motif récurrent dans une collection d'émissions, dites de plateau, alors que la seconde réalise la structuration en thèmes d'un journal télévisé. Quelques pistes de réflexions sur l'apport de l'analyse vidéo sont développées et les besoins futurs sont explorés

    CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

    Full text link
    Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for caption generation. In this work, we describe our model, which novelty, compared to existing models, lies in the use of a ConvNeXt architecture as audio encoder, adapted from the vision domain to audio classification. This model, called CNext-trans, achieved state-of-the-art scores on the AudioCaps (AC) dataset and performed competitively on Clotho (CL), while using four to forty times fewer parameters than existing models. We examine potential biases in the AC dataset due to its origin from AudioSet by investigating unbiased encoder's impact on performance. Using the well-known PANN's CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute reduction in SPIDEr score (where higher scores indicate better performance). To improve cross-dataset performance, we conducted experiments by combining multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this strategy enhanced overall model performance across datasets, it still fell short compared to models trained specifically on a single target dataset, indicating the absence of a one-size-fits-all model. To mitigate performance gaps between datasets, we introduced a Task Embedding (TE) token, allowing the model to identify the source dataset for each input sample. We provide insights into the impact of these TEs on both the form (words) and content (sound event types) of the generated captions. The resulting model, named CoNeTTE, an unbiased CNext-trans model enriched with dataset-specific Task Embeddings, achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code available: https://github.com/Labbeti/conette-audio-captioning

    Unsupervised Speech Unit Discovery Using K-means and Neural Networks

    Get PDF
    Unsupervised discovery of sub-lexical units in speech is a problem that currently interests speech researchers. In this paper, we report experiments in which we use phone segmentation followed by clustering the segments together using k-means and a Convolutional Neural Network. We thus obtain an annotation of the corpus in pseudo-phones, which then allows us to find pseudo-words. We compare the results for two different segmentations: manual and automatic. To check the portability of our approach, we compare the results for three different languages (English, French and Xitsonga). The originality of our work lies in the use of neural networks in an unsupervised way that differ from the common method for unsupervised speech unit discovery based on auto-encoders. With the Xitsonga corpus, for instance, with manual and automatic segmentations, we were able to obtain 46% and 42% purity scores, respectively, at phone-level with 30 pseudo-phones. Based on the inferred pseudo-phones, we discovered about 200 pseudo-words

    Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

    Full text link
    In this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single sentence written in natural language. Most systems are trained with the standard Cross-Entropy loss, which does not take into account the semantic closeness of the sentence. We found that adding a sentence embedding loss term reduces overfitting, but also increased SPIDEr from 0.397 to 0.418 in our first setting on the AudioCaps corpus. When we increased the weight decay value, we found our model to be much closer to the current state-of-the-art methods, with a SPIDEr score up to 0.444 compared to a 0.475 score. Moreover, this model uses eight times less trainable parameters. In this training setting, the sentence embedding loss has no more impact on the model performance

    Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

    Full text link
    Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho and 0.472 on AudioCaps on average. For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair. Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach. For instance, we obtained a Text-to-Audio R@1 value of 0.382 for Au-dioCaps, which is above the current state-of-the-art method without external data. Interestingly, we observe that normalizing the loss values was necessary for Audio-to-Text retrieval.Comment: cam ready version (14/08/23

    Reconnaissance de sons d'eau pour l'indexation en activités de la vie quotidienne

    Get PDF
    National audienceL’évaluation de troubles dans la réalisation des activités quotidienne est aujourd’hui utilisée dans le diagnostic des démences, mais se heurte à un manque d’outils objectifs. Pour pallier ce manque, le projet IMMED propose la réalisation de vidéo au domicile du patient et l’indexation automatique de ces vidéos en activités. Ces vidéos indexées permettent aux spécialistes de visualiser les patients effectuer des activités dans leur environnement habituel. Dans ce contexte, de nombreuses tâches quotidiennes ont un rapport avec l’eau : se laver les mains, faire la vaisselle, se brosser les dents, etc. Dans cet article, nous présentons deux méthodes de détection de sons d’eau pour la segmentation automatique en activité. La première méthode, basée sur des descripteurs acoustiques, permet la détection du flot d’eau. Pour reconnaître les autres types de sons d’eau, comme les gouttes, nous présentons également une approche originale qui s’appuie sur des modèles acoustiques des sons de liquide

    Water sound recognition based on physical models

    Get PDF
    International audienceThis article describes an audio signal processing algorithm to detect water sounds, built in the context of a larger system aiming to monitor daily activities of elderly people. While previous proposals for water sound recognition relied on classical machine learning and generic audio features to characterize water sounds as a flow texture, we describe here a recognition system based on a physical model of air bubble acoustics. This system is able to recognize a wide variety of water sounds and does not require training. It is validated on a home environmental sound corpus with a classification task, in which all water sounds are correctly detected. In a free detection task on a real life recording, it outperformed the classical systems and obtained 70% of F-measure

    Two-step detection of water sound events for the diagnostic and monitoring of dementia

    Get PDF
    International audienceA significant aging of world population is foreseen for the next decades. Thus, developing technologies to empower the independency and assist the elderly are becoming of great interest. In this framework, the IMMED project investigates tele-monitoring technologies to support doctors in the diagnostic and follow-up of dementia illnesses such as Alzheimer. Specifically, water sounds are very useful to track and identify abnormal behaviors form everyday activities (e.g. hygiene, household, cooking, etc.). In this work, we propose a double-stage system to detect this type of sound events. In the first stage, the audio stream is segmented with a simple but effective algorithm based on the Spectral Cover feature. The second stage improves the system precision by classifing the segmented streams into water/non-water sound events using Gammatone Cepstral Coefficients and Support Vector Machines. Experimental results reveal the potential of the combined system, yielding a F-measure higher than 80%
    • …
    corecore